import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
data=pd.read_csv('S:\Data sets\EVDatabase 2023.csv')
data.head() # checking the top 5 values to get the overview of data.
| Name | Subtitle | Acceleration | TopSpeed | Range | Efficiency | FastChargeSpeed | Drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Lucid Air Dream Edition P | 118 kWh useable battery Available sin... | 2.7 sec | 270 km/h | 645 km | 183 Wh/km | 820 km/h | All Wheel Drive | 5 | €218,000 | NaN |
| 1 | Porsche Taycan Turbo S | 83.7 kWh useable battery Available si... | 2.8 sec | 260 km/h | 400 km | 209 Wh/km | 980 km/h | All Wheel Drive | 4 | €189,668 | £142,400 |
| 2 | Audi e-tron GT RS | 85 kWh useable battery Available sinc... | 3.3 sec | 250 km/h | 405 km | 210 Wh/km | 1000 km/h | All Wheel Drive | 4 | €146,050 | £115,000 |
| 3 | Renault Zoe ZE50 R110 | 52 kWh useable battery Available sinc... | 11.4 sec | 135 km/h | 315 km | 165 Wh/km | 230 km/h | Front Wheel Drive | 5 | €36,840 | NaN |
| 4 | Audi Q4 e-tron 35 | 52 kWh useable battery Available sinc... | 9.0 sec | 160 km/h | 285 km | 182 Wh/km | 360 km/h | Rear Wheel Drive | 5 | NaN | NaN |
data.shape
(309, 11)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 309 entries, 0 to 308 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 309 non-null object 1 Subtitle 309 non-null object 2 Acceleration 309 non-null object 3 TopSpeed 309 non-null object 4 Range 309 non-null object 5 Efficiency 309 non-null object 6 FastChargeSpeed 309 non-null object 7 Drive 309 non-null object 8 NumberofSeats 309 non-null int64 9 PriceinGermany 282 non-null object 10 PriceinUK 198 non-null object dtypes: int64(1), object(10) memory usage: 26.7+ KB
data.isna().sum() #counting the null values in each column
Name 0 Subtitle 0 Acceleration 0 TopSpeed 0 Range 0 Efficiency 0 FastChargeSpeed 0 Drive 0 NumberofSeats 0 PriceinGermany 27 PriceinUK 111 dtype: int64
sns.heatmap(data.isna()) # visually checking the null vlues
<Axes: >
# Observing the Name column we conclude that first word in the string is the name of the company followed by model name
# The brand valuse of the company also affects the marketprice
data.columns
Index(['Name', 'Subtitle', 'Acceleration', 'TopSpeed', 'Range', 'Efficiency',
'FastChargeSpeed', 'Drive', 'NumberofSeats', 'PriceinGermany',
'PriceinUK'],
dtype='object')
#creating a new dataframe 'df' to store Brand name
df = pd.DataFrame({'Brand': data['Name'].str.split().str[0]}) #making dictionary to store values
df
| Brand | |
|---|---|
| 0 | Lucid |
| 1 | Porsche |
| 2 | Audi |
| 3 | Renault |
| 4 | Audi |
| ... | ... |
| 304 | Volkswagen |
| 305 | Volkswagen |
| 306 | Polestar |
| 307 | Polestar |
| 308 | Maserati |
309 rows × 1 columns
data.Subtitle.value_counts()
68 kWh useable battery Available since February 2021 8
46.3 kWh useable battery Available since November 2020 8
77 kWh useable battery Available since March 2022 5
46.3 kWh useable battery Available since November 2021 5
50.8 kWh useable battery Expected from May 2023 4
..
49 kWh useable battery Available since November 2021 1
107.8 kWh useable battery Available since February 2022 1
40 kWh useable battery Available since April 2022 1
74 kWh useable battery Available since September 2021 1
95 kWh useable battery Expected from September 2023 1
Name: Subtitle, Length: 187, dtype: int64
# There are two set of sentences
# 1st sentence is about battery
# 2nd sentence is about availability of the battery in market
df['Batery_KWh']=data.Subtitle.str.split().str[0]
df['Batery_KWh'].head() # 'Batery_KWh' will have the value of battery power allocated to vehicle
0 118 1 83.7 2 85 3 52 4 52 Name: Batery_KWh, dtype: object
data.Subtitle.head() #Comparying the top 5 values for verification
0 118 kWh useable battery Available sin... 1 83.7 kWh useable battery Available si... 2 85 kWh useable battery Available sinc... 3 52 kWh useable battery Available sinc... 4 52 kWh useable battery Available sinc... Name: Subtitle, dtype: object
KWh stands for kilowatt hour, and it is a unit of energy. It is the amount of energy used by a device that consumes one kilowatt of power for one hour. In the context of electric vehicles (EVs), kWh is used to measure the capacity of the battery and the energy consumption of the vehicle. The battery capacity of an EV is the amount of energy that the battery can store. It is typically measured in kWh. The higher the battery capacity, the longer the EV can travel on a single charge. The energy consumption of an EV is the amount of energy that the vehicle uses to travel a certain distance. It is typically measured in kWh/100km. The lower the energy consumption, the more efficient the EV is.
data.head(2)
| Name | Subtitle | Acceleration | TopSpeed | Range | Efficiency | FastChargeSpeed | Drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Lucid Air Dream Edition P | 118 kWh useable battery Available sin... | 2.7 sec | 270 km/h | 645 km | 183 Wh/km | 820 km/h | All Wheel Drive | 5 | €218,000 | NaN |
| 1 | Porsche Taycan Turbo S | 83.7 kWh useable battery Available si... | 2.8 sec | 260 km/h | 400 km | 209 Wh/km | 980 km/h | All Wheel Drive | 4 | €189,668 | £142,400 |
df['accln_sec']= data['Acceleration'].str.split().str[0]
df['TopSpeed_km/h']= data['TopSpeed'].str.split().str[0]
df['range_km']= data['Range'].str.split().str[0]
df['efficiency_wh/km']= data['Efficiency'].str.split().str[0]
df['ChargeSpeed_km/hr']= data['FastChargeSpeed'].str.split().str[0]
df['drive']= data['Drive']
df['NumberofSeats']= data['NumberofSeats']
df['PriceinGermany']= data['PriceinGermany'].str.replace('€','',regex=True)
df['PriceinGermany']= df['PriceinGermany'].str.replace(',','',regex=True)
df['PriceinUK']= data['PriceinUK'].str.replace('£','',regex=True)
df['PriceinUK']= df['PriceinUK'].str.replace(',','',regex=True)
df.head()
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Lucid | 118 | 2.7 | 270 | 645 | 183 | 820 | All Wheel Drive | 5 | 218000 | NaN |
| 1 | Porsche | 83.7 | 2.8 | 260 | 400 | 209 | 980 | All Wheel Drive | 4 | 189668 | 142400 |
| 2 | Audi | 85 | 3.3 | 250 | 405 | 210 | 1000 | All Wheel Drive | 4 | 146050 | 115000 |
| 3 | Renault | 52 | 11.4 | 135 | 315 | 165 | 230 | Front Wheel Drive | 5 | 36840 | NaN |
| 4 | Audi | 52 | 9.0 | 160 | 285 | 182 | 360 | Rear Wheel Drive | 5 | NaN | NaN |
df.shape
(309, 11)
df.info() # checking the dtypes of data frame df
<class 'pandas.core.frame.DataFrame'> RangeIndex: 309 entries, 0 to 308 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Brand 309 non-null object 1 Batery_KWh 309 non-null object 2 accln_sec 309 non-null object 3 TopSpeed_km/h 309 non-null object 4 range_km 309 non-null object 5 efficiency_wh/km 309 non-null object 6 ChargeSpeed_km/hr 309 non-null object 7 drive 309 non-null object 8 NumberofSeats 309 non-null int64 9 PriceinGermany 282 non-null object 10 PriceinUK 198 non-null object dtypes: int64(1), object(10) memory usage: 26.7+ KB
#The columns 'PriceinGermany' and 'PriceinUK' should be intiger.
# the blank spaces or null values have to be filled first before changing data types.
# filling the '0' in place null values
df['PriceinUK'] = df['PriceinUK'].fillna(0)
df['PriceinGermany']= df['PriceinGermany'].fillna(0)
# Changing the datatype from object to int
df['PriceinUK']= df['PriceinUK'].astype(int)
df['PriceinGermany']= df['PriceinGermany'].astype(int)
#changine the data types of remaining columns
df['Batery_KWh'] = df['Batery_KWh'].astype(float) #this columns have values in decimal so we will uconvert to float
df['accln_sec'] = df['accln_sec'].astype(float) #this columns have values in decimal so we will uconvert to float
df['range_km'] = df['range_km'].astype(int)
df['efficiency_wh/km'] = df['efficiency_wh/km'].astype(int)
df['TopSpeed_km/h'] = df['TopSpeed_km/h'].astype(int)
df['ChargeSpeed_km/hr'].unique() # there is '-' value which needs to be fixed for changing the data type
array(['820', '980', '1000', '230', '360', '640', '1050', '1020', '-',
'180', '970', '1130', '310', '150', '730', '380', '440', '530',
'500', '540', '390', '370', '430', '520', '840', '960', '290',
'280', '1010', '1170', '790', '770', '650', '630', '260', '160',
'1030', '220', '950', '480', '680', '330', '720', '750', '1070',
'900', '510', '490', '470', '460', '570', '670', '320', '400',
'780', '410', '190', '560', '1110', '420', '300', '710', '930',
'1090', '910', '1060', '450', '990', '170', '240', '760', '210',
'870', '920', '880', '340', '1240', '1150', '860', '580', '350',
'200', '610', '700', '940', '600', '590', '550', '620', '850',
'800'], dtype=object)
df['ChargeSpeed_km/hr'] = df['ChargeSpeed_km/hr'].str.replace('-', '').str[0].fillna(0).astype(int)
# temporarily changing '-' with 0
df.dtypes # checking the datatypes
Brand object Batery_KWh float64 accln_sec float64 TopSpeed_km/h int32 range_km int32 efficiency_wh/km int32 ChargeSpeed_km/hr int32 drive object NumberofSeats int64 PriceinGermany int32 PriceinUK int32 dtype: object
df.head()
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Lucid | 118.0 | 2.7 | 270 | 645 | 183 | 8 | All Wheel Drive | 5 | 218000 | 0 |
| 1 | Porsche | 83.7 | 2.8 | 260 | 400 | 209 | 9 | All Wheel Drive | 4 | 189668 | 142400 |
| 2 | Audi | 85.0 | 3.3 | 250 | 405 | 210 | 1 | All Wheel Drive | 4 | 146050 | 115000 |
| 3 | Renault | 52.0 | 11.4 | 135 | 315 | 165 | 2 | Front Wheel Drive | 5 | 36840 | 0 |
| 4 | Audi | 52.0 | 9.0 | 160 | 285 | 182 | 3 | Rear Wheel Drive | 5 | 0 | 0 |
df[df['ChargeSpeed_km/hr']==0]
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | Smart | 16.7 | 11.6 | 130 | 100 | 167 | 0 | Rear Wheel Drive | 2 | 21940 | 0 |
| 10 | Smart | 16.7 | 11.9 | 130 | 95 | 176 | 0 | Rear Wheel Drive | 2 | 25200 | 0 |
| 17 | Renault | 21.3 | 12.6 | 135 | 130 | 164 | 0 | Rear Wheel Drive | 4 | 28000 | 0 |
df['ChargeSpeed_km/hr']=df.groupby('Brand')['ChargeSpeed_km/hr'].transform(lambda x:x.replace(0,x.mean()))
df[df['ChargeSpeed_km/hr']==0] # Null values fixed with mean
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK |
|---|
df.Brand.unique()
array(['Lucid', 'Porsche', 'Audi', 'Renault', 'Tesla', 'Smart', 'Honda',
'Mercedes', 'Lexus', 'BMW', 'Fiat', 'Skoda', 'Nissan',
'Volkswagen', 'Citroen', 'Opel', 'Peugeot', 'Mini', 'JAC',
'Hyundai', 'Kia', 'MG', 'Toyota', 'CUPRA', 'Subaru', 'SsangYong',
'Genesis', 'Aiways', 'Mazda', 'Dacia', 'Fisker', 'Hongqi', 'NIO',
'Ford', 'Polestar', 'Rolls-Royce', 'Lotus', 'Volvo', 'ORA', 'BYD',
'Abarth', 'DS', 'Jeep', 'Maserati', 'Seres', 'VinFast', 'Jaguar',
'XPENG', 'Maxus'], dtype=object)
df.Brand.nunique() #counting total brands
49
df.Brand.value_counts()
Mercedes 38 Porsche 18 Audi 16 Peugeot 15 Volkswagen 15 Opel 14 Citroen 12 BMW 10 Tesla 10 Toyota 10 Hyundai 10 Fiat 10 Kia 9 Skoda 8 Polestar 8 Volvo 8 MG 8 Renault 7 Nissan 6 NIO 6 XPENG 6 VinFast 6 Genesis 5 Ford 5 Lucid 5 Fisker 4 Smart 4 CUPRA 3 ORA 3 BYD 3 Aiways 2 Dacia 2 Hongqi 2 Mini 2 Lexus 2 Maserati 2 Lotus 2 Jeep 2 DS 1 Seres 1 Jaguar 1 Subaru 1 Abarth 1 Rolls-Royce 1 Mazda 1 SsangYong 1 JAC 1 Honda 1 Maxus 1 Name: Brand, dtype: int64
df.drive.nunique()
3
df.drive.value_counts()
All Wheel Drive 128 Front Wheel Drive 115 Rear Wheel Drive 66 Name: drive, dtype: int64
df.NumberofSeats.nunique()
4
df.NumberofSeats.value_counts()
5 203 7 66 4 38 2 2 Name: NumberofSeats, dtype: int64
# Checking if there is a difference between 'PriceinGermany' and 'PriceinUK' because of difference in currency value wrt dollar
(df.PriceinGermany - df.PriceinUK).value_counts()
0 18
5000 6
12990 4
7000 3
7495 3
..
209 1
28214 1
28166 1
1780 1
-135000 1
Length: 263, dtype: int64
The price differences are not due to difference between the strengths of curriencies.
df.describe().T # analysing the values
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Batery_KWh | 309.0 | 71.076052 | 21.139409 | 16.7 | 56.0 | 74.0 | 85.0 | 123.0 |
| accln_sec | 309.0 | 7.532039 | 3.127109 | 2.1 | 4.9 | 7.0 | 9.0 | 19.1 |
| TopSpeed_km/h | 309.0 | 180.754045 | 38.738187 | 125.0 | 150.0 | 180.0 | 200.0 | 320.0 |
| range_km | 309.0 | 361.440129 | 110.861212 | 95.0 | 275.0 | 370.0 | 440.0 | 685.0 |
| efficiency_wh/km | 309.0 | 199.430421 | 33.021626 | 150.0 | 174.0 | 192.0 | 214.0 | 295.0 |
| ChargeSpeed_km/hr | 309.0 | 4.295192 | 2.249713 | 1.0 | 2.0 | 4.0 | 6.0 | 9.0 |
| NumberofSeats | 309.0 | 5.284790 | 0.978567 | 2.0 | 5.0 | 5.0 | 5.0 | 7.0 |
| PriceinGermany | 309.0 | 64832.462783 | 44638.555529 | 0.0 | 42000.0 | 56942.0 | 72519.0 | 400000.0 |
| PriceinUK | 309.0 | 41173.233010 | 43035.103703 | 0.0 | 0.0 | 37000.0 | 57990.0 | 350000.0 |
The min value is represented by 0 bcz we changed the NaN values to 0 in above operations. We will have to replace 0 with mean values.
#
df['PriceinGermany']=df.groupby('Brand')['PriceinGermany'].transform(lambda x:x.replace(0,x.mean()))
(df['PriceinGermany']==0).sum() # earlier there were 27 Null values
11
df['PriceinUK']=df.groupby('Brand')['PriceinUK'].transform(lambda x:x.replace(0,x.mean()))
(df['PriceinUK']==0).sum() # earlier there were 111 Null values
32
df[df['PriceinGermany']==0].head(11) # this data can be used for prediction
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | JAC | 39.0 | 12.0 | 132 | 225 | 173 | 1.0 | Front Wheel Drive | 5 | 0.0 | 0.0 |
| 214 | Hongqi | 76.5 | 6.5 | 200 | 305 | 251 | 4.0 | All Wheel Drive | 5 | 0.0 | 0.0 |
| 215 | Hongqi | 90.0 | 4.9 | 200 | 355 | 254 | 4.0 | All Wheel Drive | 5 | 0.0 | 0.0 |
| 271 | Seres | 51.0 | 8.9 | 155 | 260 | 196 | 2.0 | Front Wheel Drive | 5 | 0.0 | 0.0 |
| 287 | XPENG | 82.7 | 6.7 | 200 | 500 | 165 | 7.0 | Rear Wheel Drive | 5 | 0.0 | 0.0 |
| 288 | XPENG | 82.7 | 4.1 | 200 | 470 | 176 | 6.0 | All Wheel Drive | 5 | 0.0 | 0.0 |
| 289 | XPENG | 82.7 | 4.1 | 200 | 465 | 178 | 6.0 | All Wheel Drive | 5 | 0.0 | 0.0 |
| 290 | XPENG | 75.0 | 6.4 | 200 | 375 | 200 | 7.0 | Rear Wheel Drive | 5 | 0.0 | 0.0 |
| 291 | XPENG | 94.0 | 6.4 | 200 | 470 | 200 | 9.0 | Rear Wheel Drive | 5 | 0.0 | 0.0 |
| 292 | XPENG | 94.0 | 3.9 | 200 | 440 | 214 | 9.0 | All Wheel Drive | 5 | 0.0 | 0.0 |
| 302 | Maxus | 90.0 | 9.2 | 180 | 345 | 261 | 2.0 | Front Wheel Drive | 7 | 0.0 | 64306.0 |
case1=df[df['PriceinGermany']==0]
case1.shape
(11, 11)
case1.head()
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | JAC | 39.0 | 12.0 | 132 | 225 | 173 | 1.0 | Front Wheel Drive | 5 | 0.0 | 0.0 |
| 214 | Hongqi | 76.5 | 6.5 | 200 | 305 | 251 | 4.0 | All Wheel Drive | 5 | 0.0 | 0.0 |
| 215 | Hongqi | 90.0 | 4.9 | 200 | 355 | 254 | 4.0 | All Wheel Drive | 5 | 0.0 | 0.0 |
| 271 | Seres | 51.0 | 8.9 | 155 | 260 | 196 | 2.0 | Front Wheel Drive | 5 | 0.0 | 0.0 |
| 287 | XPENG | 82.7 | 6.7 | 200 | 500 | 165 | 7.0 | Rear Wheel Drive | 5 | 0.0 | 0.0 |
df = df[df['PriceinGermany'] != 0] # Removal of null values
case2=df[df['PriceinUK']==0]
case2.shape
(22, 11)
case2.head() #will be used for prediction
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Lucid | 118.0 | 2.7 | 270 | 645 | 183 | 8.0 | All Wheel Drive | 5 | 218000.0 | 0.0 |
| 31 | Lucid | 112.0 | 3.2 | 270 | 665 | 168 | 8.0 | All Wheel Drive | 5 | 174500.0 | 0.0 |
| 32 | Lucid | 88.0 | 3.4 | 250 | 550 | 160 | 9.0 | All Wheel Drive | 5 | 120000.0 | 0.0 |
| 33 | Lucid | 88.0 | 4.2 | 200 | 560 | 157 | 9.0 | Rear Wheel Drive | 5 | 100000.0 | 0.0 |
| 122 | SsangYong | 56.0 | 8.5 | 156 | 290 | 193 | 3.0 | Front Wheel Drive | 5 | 40490.0 | 0.0 |
df = df[df['PriceinUK'] != 0]# Removal of null values
df.shape
(276, 11)
# Most and least luxurious cars based on 'PriceinGermany'
most_luxurious = df.sort_values(by='PriceinGermany', ascending=False).head(1)
most_affordable = df.sort_values(by='PriceinGermany').head(1)
most_luxurious
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 233 | Rolls-Royce | 100.0 | 4.5 | 250 | 455 | 220 | 5.0 | All Wheel Drive | 4 | 400000.0 | 350000.0 |
most_affordable
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | Smart | 16.7 | 11.6 | 130 | 100 | 167 | 2.25 | Rear Wheel Drive | 2 | 21940.0 | 19250.0 |
# Most and least powerful battery cars based on 'Batery_KWh'
most_powerful_battery = df.sort_values(by='Batery_KWh', ascending=False).head(1)
least_powerful_battery = df.sort_values(by='Batery_KWh').head(1)
most_powerful_battery
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 175 | Mercedes | 108.4 | 4.6 | 210 | 485 | 224 | 6.0 | All Wheel Drive | 7 | 135434.0 | 139170.0 |
least_powerful_battery
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | Smart | 16.7 | 11.6 | 130 | 100 | 167 | 2.25 | Rear Wheel Drive | 2 | 21940.0 | 19250.0 |
# Fastest and slowest acceleration cars based on 'accln_sec'
fastest_acceleration = df.sort_values(by='accln_sec').head(1)
slowest_acceleration = df.sort_values(by='accln_sec', ascending=False).head(1)
fastest_acceleration
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 52 | Tesla | 95.0 | 2.1 | 282 | 550 | 173 | 7.0 | All Wheel Drive | 5 | 137990.0 | 125000.0 |
slowest_acceleration
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 48 | Peugeot | 68.0 | 14.3 | 130 | 260 | 262 | 2.0 | Front Wheel Drive | 7 | 60430.0 | 23311.666667 |
# Fastest and slowest cars based on 'TopSpeed_km/h'
fastest_car = df.sort_values(by='TopSpeed_km/h').head(1)
slowest_car = df.sort_values(by='TopSpeed_km/h', ascending=False).head(1)
fastest_car
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 127 | Citroen | 68.0 | 13.3 | 130 | 265 | 257 | 2.0 | Front Wheel Drive | 7 | 57940.0 | 21883.75 |
slowest_car
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 270 | Maserati | 83.0 | 2.7 | 320 | 425 | 195 | 9.0 | All Wheel Drive | 4 | 250000.0 | 200000.0 |
# Cars with the longest and shortest range based on 'range_km'
longest_range = df.sort_values(by='range_km', ascending=False).head(1)
shortest_range = df.sort_values(by='range_km').head(1)
longest_range
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 73 | Mercedes | 107.8 | 6.2 | 210 | 640 | 168 | 9.0 | Rear Wheel Drive | 5 | 109551.0 | 105610.0 |
shortest_range
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | Smart | 16.7 | 11.9 | 130 | 95 | 176 | 2.25 | Rear Wheel Drive | 2 | 25200.0 | 19250.0 |
# Cars with the highest and lowest efficiency based on 'efficiency_wh/km'
highest_efficiency = df.sort_values(by='efficiency_wh/km').head(1)
lowest_efficiency = df.sort_values(by='efficiency_wh/km', ascending=False).head(1)
highest_efficiency
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 200 | Hyundai | 54.0 | 8.8 | 185 | 360 | 150 | 8.0 | Rear Wheel Drive | 5 | 43900.0 | 38214.0 |
lowest_efficiency
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 30 | Mercedes | 90.0 | 12.1 | 160 | 305 | 295 | 3.0 | Front Wheel Drive | 7 | 72519.0 | 44560.078947 |
# Cars with the fastest and slowest charge speed based on 'ChargeSpeed_km/hr'
fastest_charge_speed = df.sort_values(by='ChargeSpeed_km/hr').head(1)
slowest_charge_speed = df.sort_values(by='ChargeSpeed_km/hr', ascending=False).head(1)
fastest_charge_speed
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 49 | Porsche | 71.0 | 5.4 | 230 | 410 | 173 | 1.0 | Rear Wheel Drive | 4 | 88399.0 | 75500.0 |
slowest_charge_speed
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Porsche | 83.7 | 2.8 | 260 | 400 | 209 | 9.0 | All Wheel Drive | 4 | 189668.0 | 142400.0 |
# Cars with the most and least number of seats based on 'NumberofSeats'
most_seats = df.sort_values(by='NumberofSeats', ascending=False).head(10)
least_seats = df.sort_values(by='NumberofSeats').head(2)
most_seats
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 144 | Mercedes | 60.0 | 12.0 | 160 | 205 | 293 | 3.0 | Front Wheel Drive | 7 | 61571.0 | 44560.078947 |
| 126 | Citroen | 46.3 | 12.1 | 130 | 180 | 257 | 2.0 | Front Wheel Drive | 7 | 52730.0 | 35995.000000 |
| 110 | Mercedes | 66.5 | 8.0 | 160 | 335 | 199 | 4.0 | All Wheel Drive | 7 | 55519.0 | 55310.000000 |
| 105 | Citroen | 46.3 | 11.7 | 135 | 200 | 232 | 3.0 | Front Wheel Drive | 7 | 43640.0 | 32495.000000 |
| 204 | Fiat | 46.3 | 12.1 | 130 | 185 | 250 | 2.0 | Front Wheel Drive | 7 | 55990.0 | 6429.000000 |
| 103 | Opel | 46.3 | 11.7 | 135 | 200 | 232 | 3.0 | Front Wheel Drive | 7 | 44750.0 | 34635.000000 |
| 102 | Opel | 46.3 | 11.7 | 135 | 205 | 226 | 3.0 | Front Wheel Drive | 7 | 43050.0 | 34035.000000 |
| 101 | Mercedes | 60.0 | 12.0 | 160 | 205 | 293 | 1.0 | Front Wheel Drive | 7 | 68949.0 | 44560.078947 |
| 100 | Mercedes | 60.0 | 12.0 | 160 | 205 | 293 | 1.0 | Front Wheel Drive | 7 | 68056.0 | 44560.078947 |
| 205 | Fiat | 68.0 | 13.3 | 130 | 265 | 257 | 2.0 | Front Wheel Drive | 7 | 61990.0 | 6429.000000 |
least_seats
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | Smart | 16.7 | 11.6 | 130 | 100 | 167 | 2.25 | Rear Wheel Drive | 2 | 21940.0 | 19250.0 |
| 10 | Smart | 16.7 | 11.9 | 130 | 95 | 176 | 2.25 | Rear Wheel Drive | 2 | 25200.0 | 19250.0 |
# Most and least expensive cars in the UK based on 'PriceinUK'
most_expensive_uk = df.sort_values(by='PriceinUK', ascending=False).head(1)
least_expensive_uk = df.sort_values(by='PriceinUK').head(1)
most_expensive_uk
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 233 | Rolls-Royce | 100.0 | 4.5 | 250 | 455 | 220 | 5.0 | All Wheel Drive | 4 | 400000.0 | 350000.0 |
least_expensive_uk
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 207 | Fiat | 68.0 | 13.3 | 130 | 260 | 262 | 2.0 | Front Wheel Drive | 7 | 62990.0 | 6429.0 |
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True)
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\249308572.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. sns.heatmap(df.corr(), annot=True)
<Axes: >
1) Battery capacity is positively correlated with range (0.89) and efficiency (0.62). This means that cars with larger batteries can travel further on a single charge and are more efficient at using energy.
2) Acceleration is negatively correlated with range (-0.75) and efficiency (-0.87). This means that cars that can accelerate faster tend to have shorter ranges and are less efficient.
3) Top speed is also negatively correlated with range (-0.50) and efficiency (-0.75). This means that cars with higher top speeds tend to have shorter ranges and are less efficient.
4) Price is positively correlated with battery capacity (0.62), range (0.51), efficiency (0.75), and top speed (0.72). This means that more expensive cars typically have larger batteries, longer ranges, are more efficient, and have higher top speeds.
df.columns
Index(['Brand', 'Batery_KWh', 'accln_sec', 'TopSpeed_km/h', 'range_km',
'efficiency_wh/km', 'ChargeSpeed_km/hr', 'drive', 'NumberofSeats',
'PriceinGermany', 'PriceinUK'],
dtype='object')
ax=df.Brand.value_counts() #visualising the available options to choose from
plt.figure(figsize=(12,5))
bx=sns.barplot(x=ax.index, y=ax.values)
plt.title('Number of electric cars offered by each company')
plt.xlabel('Car Brand')
plt.ylabel('Count')
plt.xticks(rotation=90)
for bars in bx.containers:
bx.bar_label(bars)
Mercedes offer most number of electric vehicles in the market.
ax=df.drive.value_counts() #Distribution of drive systems
plt.figure(figsize=(4, 4))
plt.pie(ax, labels=ax.index, autopct='%1.1f%%', startangle=90)
plt.title('Popular Drive systems')
Text(0.5, 1.0, 'Popular Drive systems')
Majority vehicles are using all wheel drive and front wheel drive.
plt.figure(figsize=(7,5)) #visualizing the count of the drive system
ax=sns.countplot(x='drive', data=df)
plt.title('Drive system Distribution and count')
plt.xlabel('drive system')
plt.ylabel('Count')
plt.xticks(rotation=0)
for bars in ax.containers:
ax.bar_label(bars)
plt.figure(figsize=(10,18)) #
ax=sns.countplot(y='Brand', hue='drive', data=df)
plt.title('Drive system used by manufacturers')
plt.ylabel('Car Brand')
plt.xlabel('Count')
plt.xticks(rotation=0)
for bars in ax.containers:
ax.bar_label(bars)
plt.figure(figsize=(7,5)) #
ax=sns.countplot(x='NumberofSeats', data=df)
plt.title('Seat distributionan')
plt.xlabel('seats')
plt.ylabel('Count')
plt.xticks(rotation=0)
for bars in ax.containers:
ax.bar_label(bars)
Majority of vehicles are having 5 seats.
plt.figure(figsize=(7,5)) #
ax=sns.countplot(x='NumberofSeats', hue='drive', data=df)
plt.title('Seat distributionan and drive')
plt.xlabel('seats')
plt.ylabel('Count')
plt.xticks(rotation=0)
for bars in ax.containers:
ax.bar_label(bars)
1) Majority od vehicles having 5 seats use all wheel drive. 2) Front wheel drive is preferred more in 7 seater vehicles.
# checking if brandvalue have an effect on cost or not.
brand_cost=df.groupby(['Brand'],as_index=False)['PriceinGermany'].mean().sort_values(by='PriceinGermany',ascending=False)
brand_cost.head()
| Brand | PriceinGermany | |
|---|---|---|
| 30 | Rolls-Royce | 400000.000000 |
| 19 | Maserati | 187500.000000 |
| 28 | Porsche | 130015.944444 |
| 17 | Lotus | 123490.000000 |
| 2 | BMW | 95820.000000 |
sns.relplot(y="Brand", x="PriceinGermany", height=10, aspect=0.8 ,data=brand_cost)
plt.title('Average cost of an EV for each brand')
plt.xlabel('Average price in Germany')
plt.ylabel('Brand')
Text(-50.55555555555555, 0.5, 'Brand')
# checking if brandvalue have an effect on cost or not.
brand_cost_uk =df.groupby(['Brand'],as_index=False)['PriceinUK'].mean().sort_values(by='PriceinUK',ascending=False)
brand_cost_uk.head()
| Brand | PriceinUK | |
|---|---|---|
| 30 | Rolls-Royce | 350000.000000 |
| 19 | Maserati | 167500.000000 |
| 17 | Lotus | 104750.000000 |
| 28 | Porsche | 102628.333333 |
| 2 | BMW | 86985.000000 |
sns.relplot(y="Brand", x="PriceinUK", height=10, aspect=0.8 ,data=brand_cost_uk)
plt.title('Average cost of an EV for each brand')
plt.xlabel('Average price in UK')
plt.ylabel('Brand')
Text(-50.55555555555555, 0.5, 'Brand')
#based on usage there will be relation between the battery capacity and acceleration of the vehicle.
sns.relplot(x="Batery_KWh", y="accln_sec", height=6,aspect=2,hue="drive",data=df)
plt.title('Battery capacity vs acceleration')
plt.xlabel('Battery capacity (kwh)')
plt.ylabel('acceleration (sec)')
plt.xticks(rotation=0)
(array([ 0., 20., 40., 60., 80., 100., 120.]), [Text(0.0, 0, '0'), Text(20.0, 0, '20'), Text(40.0, 0, '40'), Text(60.0, 0, '60'), Text(80.0, 0, '80'), Text(100.0, 0, '100'), Text(120.0, 0, '120')])
All wheel drive cars are having huge batteries and the acceleraton is really fast.
sns.relplot(x="Batery_KWh", y="accln_sec", height=6,aspect=2,hue="NumberofSeats",data=df,palette=["red", "green", "blue", "black"])
plt.title('Battery capacity vs acceleration')
plt.xlabel('Battery capacity (kwh)')
plt.ylabel('acceleration (sec)')
plt.xticks(rotation=0)
(array([ 0., 20., 40., 60., 80., 100., 120.]), [Text(0.0, 0, '0'), Text(20.0, 0, '20'), Text(40.0, 0, '40'), Text(60.0, 0, '60'), Text(80.0, 0, '80'), Text(100.0, 0, '100'), Text(120.0, 0, '120')])
Majority cars are 5 seaters. 5 seater cars which are fast have an all wheel drive system.
#relation of battery and top speed w.r.t. Number of seets
sns.relplot(x="Batery_KWh", y="TopSpeed_km/h", height=6,aspect=2,hue="NumberofSeats",data=df,palette=["red", "green", "blue", "black"])
plt.title('Battery capacity vs top speed')
plt.xlabel('Battery capacity (kwh)')
plt.ylabel('Top speed (km/h)')
plt.xticks(rotation=0)
(array([ 0., 20., 40., 60., 80., 100., 120.]), [Text(0.0, 0, '0'), Text(20.0, 0, '20'), Text(40.0, 0, '40'), Text(60.0, 0, '60'), Text(80.0, 0, '80'), Text(100.0, 0, '100'), Text(120.0, 0, '120')])
#relation of battery and top speed w.r.t. drive
sns.relplot(x="Batery_KWh", y="TopSpeed_km/h", height=6,aspect=2,hue="drive",data=df,)
plt.title('Battery capacity vs top speed')
plt.xlabel('Battery capacity (kwh)')
plt.ylabel('Top speed (km/h)')
plt.xticks(rotation=0)
(array([ 0., 20., 40., 60., 80., 100., 120.]), [Text(0.0, 0, '0'), Text(20.0, 0, '20'), Text(40.0, 0, '40'), Text(60.0, 0, '60'), Text(80.0, 0, '80'), Text(100.0, 0, '100'), Text(120.0, 0, '120')])
All wheel drive cars are majoritily 5 seater cars and have better performance in terms of battery capacity and top speed.
# Speed vs range graph
sns.relplot(x="TopSpeed_km/h", y="range_km",height=6, hue="drive",data=df)
plt.title('top speed vs range')
plt.xlabel('top speed (km/h)')
plt.ylabel('range (km)')
Text(49.62089120370371, 0.5, 'range (km)')
# Speed vs range graph
sns.relplot(x="TopSpeed_km/h", y="range_km",height=6, hue="NumberofSeats",palette=["red", "green", "blue", "black"],data=df)
plt.title('top speed vs range')
plt.xlabel('top speed (km/h)')
plt.ylabel('range (km)')
Text(43.016428755144034, 0.5, 'range (km)')
All wheel drive EVs are better in terms of renge and top speed.
# Usually fast cars have higher price range so we will plot a graph for top speed and price
sns.relplot(x="TopSpeed_km/h", y="PriceinGermany",height=6, hue="drive",data=df)
plt.title('top speed vs price')
plt.xlabel('top speed (km/h)')
plt.ylabel('Price in Germany')
Text(57.364328703703706, 0.5, 'Price in Germany')
# Usually fast cars have higher price range so we will plot a graph for top speed and price
sns.relplot(x="TopSpeed_km/h", y="PriceinGermany",height=6, hue="NumberofSeats",data=df, palette=["red", "green", "blue", "black"])
plt.title('top speed vs price')
plt.xlabel('top speed (km/h)')
plt.ylabel('Price in Germany')
Text(48.27979681069958, 0.5, 'Price in Germany')
The faster the car, the higher is the price.
import plotly.express as px # For 3d visualization
fig = px.scatter_3d(df, x='accln_sec', y='efficiency_wh/km', z='PriceinGermany', color='Brand', height=800, width=800)
fig.update_layout(title='3D scatter plot of electric vehicles')
fig.show()
Price, acceleration and efficiency are the major factors considered while a customer tries to buy a car. Mercedes presents a wide range for selection where we can observe that the company has most economical car along with cars that are fast and fun to drive but little less on efficiency.
df.head()
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Porsche | 83.7 | 2.8 | 260 | 400 | 209 | 9.0 | All Wheel Drive | 4 | 189668.000 | 142400.000 |
| 2 | Audi | 85.0 | 3.3 | 250 | 405 | 210 | 1.0 | All Wheel Drive | 4 | 146050.000 | 115000.000 |
| 3 | Renault | 52.0 | 11.4 | 135 | 315 | 165 | 2.0 | Front Wheel Drive | 5 | 36840.000 | 9570.000 |
| 4 | Audi | 52.0 | 9.0 | 160 | 285 | 182 | 3.0 | Rear Wheel Drive | 5 | 70905.625 | 56860.625 |
| 5 | Tesla | 75.0 | 3.7 | 250 | 415 | 181 | 6.0 | All Wheel Drive | 5 | 63667.000 | 59990.000 |
df.drive.unique()
array(['All Wheel Drive', 'Front Wheel Drive', 'Rear Wheel Drive'],
dtype=object)
For the predictive model we will convert the drive column to numeric values with the help of mapping function. The mapping values are as following - 1) Front Wheel Drive: 1 2) Rear Wheel Drive: 2 3) All Wheel Drive: 3
# Define the mapping dictionary
drive_mapping = {'Front Wheel Drive': 1,'Rear Wheel Drive': 2, 'All Wheel Drive': 3 }
# Use the map function to create a new column with the mapped values
df['drive'] = df['drive'].map(drive_mapping)
df.head() #checking the
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Porsche | 83.7 | 2.8 | 260 | 400 | 209 | 9.0 | 3 | 4 | 189668.000 | 142400.000 |
| 2 | Audi | 85.0 | 3.3 | 250 | 405 | 210 | 1.0 | 3 | 4 | 146050.000 | 115000.000 |
| 3 | Renault | 52.0 | 11.4 | 135 | 315 | 165 | 2.0 | 1 | 5 | 36840.000 | 9570.000 |
| 4 | Audi | 52.0 | 9.0 | 160 | 285 | 182 | 3.0 | 2 | 5 | 70905.625 | 56860.625 |
| 5 | Tesla | 75.0 | 3.7 | 250 | 415 | 181 | 6.0 | 3 | 5 | 63667.000 | 59990.000 |
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Batery_KWh | 276.0 | 69.996377 | 20.652600 | 16.7 | 52.00 | 70.5 | 83.700 | 108.4 |
| accln_sec | 276.0 | 7.691667 | 3.058515 | 2.1 | 5.10 | 7.3 | 9.425 | 14.3 |
| TopSpeed_km/h | 276.0 | 179.021739 | 38.784058 | 130.0 | 150.00 | 172.5 | 205.000 | 320.0 |
| range_km | 276.0 | 354.891304 | 107.165068 | 95.0 | 265.00 | 365.0 | 431.250 | 640.0 |
| efficiency_wh/km | 276.0 | 200.137681 | 33.484801 | 150.0 | 174.00 | 192.0 | 217.000 | 295.0 |
| ChargeSpeed_km/hr | 276.0 | 4.196429 | 2.219032 | 1.0 | 2.00 | 4.0 | 6.000 | 9.0 |
| drive | 276.0 | 2.000000 | 0.882146 | 1.0 | 1.00 | 2.0 | 3.000 | 3.0 |
| NumberofSeats | 276.0 | 5.304348 | 1.009833 | 2.0 | 5.00 | 5.0 | 5.000 | 7.0 |
| PriceinGermany | 276.0 | 68599.975738 | 39872.583226 | 21940.0 | 44657.25 | 57195.0 | 71849.250 | 400000.0 |
| PriceinUK | 276.0 | 53853.342830 | 36782.271092 | 6429.0 | 34758.75 | 45477.5 | 61321.250 | 350000.0 |
We will create a new data frame df1 to store the values after removing the outliers.
Using (Q1 - 1.5 IQR) as the minimum cap and (Q3 + 1.5 IQR) as the max cap.
df.columns
Index(['Brand', 'Batery_KWh', 'accln_sec', 'TopSpeed_km/h', 'range_km',
'efficiency_wh/km', 'ChargeSpeed_km/hr', 'drive', 'NumberofSeats',
'PriceinGermany', 'PriceinUK'],
dtype='object')
# Graphically analysing the distribution of data
plt.figure(figsize=(16,5))
plt.subplot(1,3,1)
sns.distplot(df['Batery_KWh'])
plt.subplot(1,3,2)
sns.distplot(df['accln_sec'])
plt.subplot(1,3,3)
sns.distplot(df['TopSpeed_km/h'])
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\3526774868.py:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\3526774868.py:6: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\3526774868.py:9: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<Axes: xlabel='TopSpeed_km/h', ylabel='Density'>
plt.figure(figsize=(16,5))
plt.subplot(1,3,1)
sns.distplot(df['range_km'])
plt.subplot(1,3,2)
sns.distplot(df['efficiency_wh/km'])
plt.subplot(1,3,3)
sns.distplot(df['ChargeSpeed_km/hr'])
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\918168938.py:4: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\918168938.py:7: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\918168938.py:10: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<Axes: xlabel='ChargeSpeed_km/hr', ylabel='Density'>
plt.figure(figsize=(16,5))
plt.subplot(1,3,1)
sns.distplot(df['NumberofSeats'])
plt.subplot(1,3,2)
sns.distplot(df['PriceinGermany'])
plt.subplot(1,3,3)
sns.distplot(df['PriceinUK'])
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\122704065.py:4: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\122704065.py:7: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\122704065.py:10: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<Axes: xlabel='PriceinUK', ylabel='Density'>
plt.figure(figsize=(16,10))
plt.subplot(3,2,1)
sns.distplot(df['drive'])
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\3983977722.py:4: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<Axes: xlabel='drive', ylabel='Density'>
Upon careful examination we found that the data is not normally distributed
df1=df #creating a new dataframe
# Batery_KWh
q1=df1['Batery_KWh'].quantile(0.25)
q3=df1['Batery_KWh'].quantile(0.75)
iqr=q3-q1
print(iqr)
ll=q1-1.5*iqr
ul=q3+1.5*iqr
df1=df1[~((df['Batery_KWh']<ll) | (df1['Batery_KWh']>ul))]
df1.shape
31.700000000000003
(276, 11)
# accln_sec
q1=df1['accln_sec'].quantile(0.25)
q3=df1['accln_sec'].quantile(0.75)
iqr=q3-q1
print(iqr)
ll=q1-1.5*iqr
ul=q3+1.5*iqr
df1=df1[~((df['accln_sec']<ll) | (df1['accln_sec']>ul))]
df1.shape
4.325000000000001
(276, 11)
# TopSpeed_km/h
q1=df1['TopSpeed_km/h'].quantile(0.25)
q3=df1['TopSpeed_km/h'].quantile(0.75)
iqr=q3-q1
print(iqr)
ll=q1-1.5*iqr
ul=q3+1.5*iqr
df1=df1[~((df['TopSpeed_km/h']<ll) | (df1['TopSpeed_km/h']>ul))]
df1.shape
55.0
(275, 11)
# range_km
q1=df1['range_km'].quantile(0.25)
q3=df1['range_km'].quantile(0.75)
iqr=q3-q1
print(iqr)
ll=q1-1.5*iqr
ul=q3+1.5*iqr
df1=df1[~((df['range_km']<ll) | (df1['range_km']>ul))]
df1.shape
167.5
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\447240164.py:8: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
(275, 11)
# efficiency_wh/km
q1=df1['efficiency_wh/km'].quantile(0.25)
q3=df1['efficiency_wh/km'].quantile(0.75)
iqr=q3-q1
print(iqr)
ll=q1-1.5*iqr
ul=q3+1.5*iqr
df1=df1[~((df['efficiency_wh/km']<ll) | (df1['efficiency_wh/km']>ul))]
df1.shape
43.0
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\2767794084.py:8: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
(267, 11)
# ChargeSpeed_km/hr
q1=df1['ChargeSpeed_km/hr'].quantile(0.25)
q3=df1['ChargeSpeed_km/hr'].quantile(0.75)
iqr=q3-q1
print(iqr)
ll=q1-1.5*iqr
ul=q3+1.5*iqr
df1=df1[~((df['ChargeSpeed_km/hr']<ll) | (df1['ChargeSpeed_km/hr']>ul))]
df1.shape
4.0
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\3507654871.py:8: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
(267, 11)
# PriceinGermany
q1=df1['PriceinGermany'].quantile(0.25)
q3=df1['PriceinGermany'].quantile(0.75)
iqr=q3-q1
print(iqr)
ll=q1-1.5*iqr
ul=q3+1.5*iqr
df1=df1[~((df['PriceinGermany']<ll) | (df1['PriceinGermany']>ul))]
df1.shape
27995.0
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\2758903175.py:8: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
(237, 11)
# PriceinUK
q1=df1['PriceinUK'].quantile(0.25)
q3=df1['PriceinUK'].quantile(0.75)
iqr=q3-q1
print(iqr)
ll=q1-1.5*iqr
ul=q3+1.5*iqr
df1=df1[~((df['PriceinUK']<ll) | (df1['PriceinUK']>ul))]
df1.shape
23010.0
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\701170901.py:8: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
(228, 11)
print('Total number of rows removed', df.shape[0] - df1.shape[0])
print('There were originally', df.shape[0] ,'rows present.')
Total number of rows removed 48 There were originally 276 rows present.
plt.figure(figsize=(12,3))
plt.subplot(1,2,1)
sns.distplot(df['Batery_KWh'])
plt.subplot(1,2,2)
sns.distplot(df1['Batery_KWh'])
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\1827589121.py:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\1827589121.py:6: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<Axes: xlabel='Batery_KWh', ylabel='Density'>
plt.figure(figsize=(10,6))
plt.subplot(2,2,1)
sns.distplot(df['accln_sec'])
plt.subplot(2,2,2)
sns.distplot(df1['accln_sec'])
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\1795226719.py:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\1795226719.py:6: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<Axes: xlabel='accln_sec', ylabel='Density'>
plt.figure(figsize=(10,6))
plt.subplot(2,2,3)
sns.distplot(df['TopSpeed_km/h'])
plt.subplot(2,2,4)
sns.distplot(df1['TopSpeed_km/h'])
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\3652080977.py:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\3652080977.py:6: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<Axes: xlabel='TopSpeed_km/h', ylabel='Density'>
plt.figure(figsize=(10,6))
plt.subplot(2,2,1)
sns.distplot(df['range_km'])
plt.subplot(2,2,2)
sns.distplot(df1['range_km'])
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\231585420.py:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\231585420.py:6: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<Axes: xlabel='range_km', ylabel='Density'>
plt.figure(figsize=(10,6))
plt.subplot(2,2,1)
sns.distplot(df['efficiency_wh/km'])
plt.subplot(2,2,2)
sns.distplot(df1['efficiency_wh/km'])
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\3096015979.py:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\3096015979.py:6: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<Axes: xlabel='efficiency_wh/km', ylabel='Density'>
plt.figure(figsize=(10,6))
plt.subplot(2,2,1)
sns.distplot(df['ChargeSpeed_km/hr'])
plt.subplot(2,2,2)
sns.distplot(df1['ChargeSpeed_km/hr'])
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\426870677.py:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\426870677.py:6: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<Axes: xlabel='ChargeSpeed_km/hr', ylabel='Density'>
plt.figure(figsize=(10,6))
plt.subplot(2,2,1)
sns.distplot(df['drive'])
plt.subplot(2,2,2)
sns.distplot(df1['drive'])
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\1651637993.py:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\1651637993.py:6: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<Axes: xlabel='drive', ylabel='Density'>
plt.figure(figsize=(10,6))
plt.subplot(2,2,1)
sns.distplot(df['PriceinGermany'])
plt.subplot(2,2,2)
sns.distplot(df1['PriceinGermany'])
C:\Users\prati\AppData\Local\Temp\ipykernel_2916\3328977471.py:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\prati\AppData\Local\Temp\ipykernel_2916\3328977471.py:6: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<Axes: xlabel='PriceinGermany', ylabel='Density'>
On comparisn we found that the data is taking the shape of bell curve after the outliers has been removed.
df1.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Batery_KWh | 228.0 | 65.346930 | 18.823079 | 16.7 | 50.800 | 68.000000 | 77.00 | 108.4 |
| accln_sec | 228.0 | 8.183333 | 2.805954 | 3.3 | 6.175 | 7.850000 | 9.75 | 14.3 |
| TopSpeed_km/h | 228.0 | 169.587719 | 31.641803 | 130.0 | 150.000 | 160.000000 | 185.00 | 261.0 |
| range_km | 228.0 | 338.618421 | 101.666506 | 95.0 | 260.000 | 347.500000 | 410.00 | 615.0 |
| efficiency_wh/km | 228.0 | 195.907895 | 31.126505 | 150.0 | 172.750 | 188.500000 | 207.00 | 262.0 |
| ChargeSpeed_km/hr | 228.0 | 3.987782 | 2.021628 | 1.0 | 2.000 | 4.000000 | 5.00 | 9.0 |
| drive | 228.0 | 1.864035 | 0.846826 | 1.0 | 1.000 | 2.000000 | 3.00 | 3.0 |
| NumberofSeats | 228.0 | 5.298246 | 0.965756 | 2.0 | 5.000 | 5.000000 | 5.00 | 7.0 |
| PriceinGermany | 228.0 | 55550.167999 | 17346.996616 | 21940.0 | 42330.000 | 53707.000000 | 63300.00 | 113359.0 |
| PriceinUK | 228.0 | 42573.275923 | 19033.912698 | 6429.0 | 31870.000 | 44131.150585 | 53680.00 | 89500.0 |
pip install scikit-learn --upgrade
Requirement already satisfied: scikit-learn in c:\users\prati\anaconda3\lib\site-packages (1.3.2)Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip is available: 23.2.1 -> 23.3.1 [notice] To update, run: python.exe -m pip install --upgrade pip
Requirement already satisfied: numpy<2.0,>=1.17.3 in c:\users\prati\anaconda3\lib\site-packages (from scikit-learn) (1.23.5) Requirement already satisfied: scipy>=1.5.0 in c:\users\prati\anaconda3\lib\site-packages (from scikit-learn) (1.10.0) Requirement already satisfied: joblib>=1.1.1 in c:\users\prati\anaconda3\lib\site-packages (from scikit-learn) (1.1.1) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\prati\anaconda3\lib\site-packages (from scikit-learn) (2.2.0)
pip install xgboost
Requirement already satisfied: xgboost in c:\users\prati\anaconda3\lib\site-packages (2.0.1) Requirement already satisfied: numpy in c:\users\prati\anaconda3\lib\site-packages (from xgboost) (1.23.5) Requirement already satisfied: scipy in c:\users\prati\anaconda3\lib\site-packages (from xgboost) (1.10.0) Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip is available: 23.2.1 -> 23.3.1 [notice] To update, run: python.exe -m pip install --upgrade pip
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, f1_score
NOTE- for accuracy and efficiency only the price of German market has been considered
df1.columns
Index(['Brand', 'Batery_KWh', 'accln_sec', 'TopSpeed_km/h', 'range_km',
'efficiency_wh/km', 'ChargeSpeed_km/hr', 'drive', 'NumberofSeats',
'PriceinGermany', 'PriceinUK'],
dtype='object')
# Define features and target variables
X = df1.drop(columns=['PriceinGermany', 'PriceinUK'])
y = df1[['PriceinGermany']]
X = X.drop(columns=['Brand']) #brand value may be a factor affecting price but it is not the primary factor hence can be ignored here
# train test split
#since data is not very large, only 20 % is taken as test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=92)
print(X_train.shape,y_train.shape)
(182, 8) (182, 1)
print(X_test.shape,y_test.shape)
(46, 8) (46, 1)
import statsmodels.formula.api as smf
import statsmodels.api as sm
model=sm.OLS(y_train,X_train).fit()
model.summary()
| Dep. Variable: | PriceinGermany | R-squared (uncentered): | 0.971 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared (uncentered): | 0.970 |
| Method: | Least Squares | F-statistic: | 726.3 |
| Date: | Sat, 28 Oct 2023 | Prob (F-statistic): | 2.29e-129 |
| Time: | 00:12:15 | Log-Likelihood: | -1932.9 |
| No. Observations: | 182 | AIC: | 3882. |
| Df Residuals: | 174 | BIC: | 3907. |
| Df Model: | 8 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Batery_KWh | 1210.5890 | 213.590 | 5.668 | 0.000 | 789.028 | 1632.150 |
| accln_sec | -162.0714 | 675.417 | -0.240 | 0.811 | -1495.136 | 1170.993 |
| TopSpeed_km/h | 283.0061 | 47.827 | 5.917 | 0.000 | 188.611 | 377.401 |
| range_km | -161.0347 | 41.262 | -3.903 | 0.000 | -242.474 | -79.596 |
| efficiency_wh/km | -4.9222 | 81.163 | -0.061 | 0.952 | -165.112 | 155.268 |
| ChargeSpeed_km/hr | -790.3493 | 513.228 | -1.540 | 0.125 | -1803.303 | 222.605 |
| drive | -1439.0378 | 1563.245 | -0.921 | 0.359 | -4524.401 | 1646.325 |
| NumberofSeats | -1706.8887 | 1432.278 | -1.192 | 0.235 | -4533.764 | 1119.987 |
| Omnibus: | 40.606 | Durbin-Watson: | 2.274 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 102.789 |
| Skew: | 0.943 | Prob(JB): | 4.78e-23 |
| Kurtosis: | 6.162 | Cond. No. | 1.02e+03 |
# Regression models
random_forest = RandomForestRegressor()
decision_tree = DecisionTreeRegressor()
xg_boost = XGBRegressor()
lin_reg = LinearRegression()
mypipeline= [random_forest, decision_tree, xg_boost, lin_reg]
# training the pipelines
for pipe in mypipeline:
pipe.fit(X_train,y_train)
C:\Users\prati\anaconda3\lib\site-packages\sklearn\base.py:1152: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
PipelineDict = {0: 'Random Forest', 1: 'Decision Tree', 2:'XG Boost', 3: 'Linear Regression'}
for i,model in enumerate(mypipeline):
print('{} TestAccuracy:{}'.format
(PipelineDict[i],model.score(X_test,y_test)))
Random Forest TestAccuracy:0.8941580637830726 Decision Tree TestAccuracy:0.9178649563850655 XG Boost TestAccuracy:0.9160108790498399 Linear Regression TestAccuracy:0.7374683649530456
accuracy=0
classifier=0
pipeline=""
for i, model in enumerate(mypipeline):
if model.score(X_test,y_test)>accuracy:
accuracy=model.score(X_test,y_test)
pipeline= model
classifier=i
print('classifier with best accuracy:{}'.format(PipelineDict[classifier]))
classifier with best accuracy:Decision Tree
Since decision_tree has the highest accuracy we will proceed with linear regression model for calculating error and F1 score
decision_tree.score(X_test,y_test)
0.9178649563850655
predictions = decision_tree.predict(X_test) #prediction
predictions
array([ 59200. , 36400. , 67300. , 44668. ,
70800. , 47595. , 95990. , 108867. ,
54430. , 53000. , 57940. , 57000. ,
59580. , 35700. , 42644.66666667, 94123. ,
69950. , 66069. , 33240. , 69950. ,
33240. , 61100. , 37900. , 39682.5 ,
55900. , 47564.875 , 47564.875 , 40000. ,
37840. , 95074. , 59128. , 55980. ,
36790. , 42145. , 40846.66666667, 55162.4 ,
36790. , 64530. , 54430. , 59950. ,
42900. , 36840. , 89351. , 40650. ,
54000. , 40150. ])
# Calculate MSE
mse = mean_squared_error(y_test, predictions)
mse
25085634.753916066
# Calculate MAE
mae = mean_absolute_error(y_test, predictions)
mae
2215.8916666666664
# Calculate R2 score
r2 = r2_score(y_test, lin_predictions)
r2
0.5527793683892224
import math
# Calculate RMSE
rmse = math.sqrt(mse)
rmse
5008.5561546134295
df1.head()
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Renault | 52.0 | 11.4 | 135 | 315 | 165 | 2.0 | 1 | 5 | 36840.000 | 9570.000 |
| 4 | Audi | 52.0 | 9.0 | 160 | 285 | 182 | 3.0 | 2 | 5 | 70905.625 | 56860.625 |
| 5 | Tesla | 75.0 | 3.7 | 250 | 415 | 181 | 6.0 | 3 | 5 | 63667.000 | 59990.000 |
| 6 | Porsche | 83.7 | 5.1 | 220 | 425 | 197 | 1.0 | 3 | 4 | 98514.000 | 84500.000 |
| 7 | Renault | 52.0 | 9.5 | 140 | 310 | 168 | 2.0 | 1 | 5 | 37840.000 | 29995.000 |
y_predict = decision_tree.predict([[52.0,11.4,135,315,165,2.0,1,5]]) #taking the 1st value from df1.head()
y_predict
C:\Users\prati\anaconda3\lib\site-packages\sklearn\base.py:465: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
array([36840.])
df1.NumberofSeats.value_counts()
5 156 7 48 4 22 2 2 Name: NumberofSeats, dtype: int64
case1.head()
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | JAC | 39.0 | 12.0 | 132 | 225 | 173 | 1.0 | Front Wheel Drive | 5 | 0.0 | 0.0 |
| 214 | Hongqi | 76.5 | 6.5 | 200 | 305 | 251 | 4.0 | All Wheel Drive | 5 | 0.0 | 0.0 |
| 215 | Hongqi | 90.0 | 4.9 | 200 | 355 | 254 | 4.0 | All Wheel Drive | 5 | 0.0 | 0.0 |
| 271 | Seres | 51.0 | 8.9 | 155 | 260 | 196 | 2.0 | Front Wheel Drive | 5 | 0.0 | 0.0 |
| 287 | XPENG | 82.7 | 6.7 | 200 | 500 | 165 | 7.0 | Rear Wheel Drive | 5 | 0.0 | 0.0 |
y_predict = decision_tree.predict([[39.0,12.0,132,225,173,1.0,1,5]]) #taking the 1st value from case1.head()
y_predict
C:\Users\prati\anaconda3\lib\site-packages\sklearn\base.py:465: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
array([50832.22222222])
case2.head()
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Lucid | 118.0 | 2.7 | 270 | 645 | 183 | 8.0 | All Wheel Drive | 5 | 218000.0 | 0.0 |
| 31 | Lucid | 112.0 | 3.2 | 270 | 665 | 168 | 8.0 | All Wheel Drive | 5 | 174500.0 | 0.0 |
| 32 | Lucid | 88.0 | 3.4 | 250 | 550 | 160 | 9.0 | All Wheel Drive | 5 | 120000.0 | 0.0 |
| 33 | Lucid | 88.0 | 4.2 | 200 | 560 | 157 | 9.0 | Rear Wheel Drive | 5 | 100000.0 | 0.0 |
| 122 | SsangYong | 56.0 | 8.5 | 156 | 290 | 193 | 3.0 | Front Wheel Drive | 5 | 40490.0 | 0.0 |
y_predict = decision_tree.predict([[118.0,2.7,270,645,183,8.0,3,5]]) #taking the 1st value from case2.head()
y_predict
C:\Users\prati\anaconda3\lib\site-packages\sklearn\base.py:465: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
array([113359.])
This ML model can help:
df1.head()
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Renault | 52.0 | 11.4 | 135 | 315 | 165 | 2.0 | 1 | 5 | 36840.000 | 9570.000 |
| 4 | Audi | 52.0 | 9.0 | 160 | 285 | 182 | 3.0 | 2 | 5 | 70905.625 | 56860.625 |
| 5 | Tesla | 75.0 | 3.7 | 250 | 415 | 181 | 6.0 | 3 | 5 | 63667.000 | 59990.000 |
| 6 | Porsche | 83.7 | 5.1 | 220 | 425 | 197 | 1.0 | 3 | 4 | 98514.000 | 84500.000 |
| 7 | Renault | 52.0 | 9.5 | 140 | 310 | 168 | 2.0 | 1 | 5 | 37840.000 | 29995.000 |
# Define the reverse mapping dictionary
reverse_drive_mapping = {1: 'Front Wheel Drive', 2: 'Rear Wheel Drive', 3: 'All Wheel Drive'}
# Use the map function to restore the original values
df1['drive'] = df1['drive'].map(reverse_drive_mapping)
df1.head()
| Brand | Batery_KWh | accln_sec | TopSpeed_km/h | range_km | efficiency_wh/km | ChargeSpeed_km/hr | drive | NumberofSeats | PriceinGermany | PriceinUK | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | Renault | 52.0 | 11.4 | 135 | 315 | 165 | 2.0 | Front Wheel Drive | 5 | 36840.000 | 9570.000 |
| 4 | Audi | 52.0 | 9.0 | 160 | 285 | 182 | 3.0 | Rear Wheel Drive | 5 | 70905.625 | 56860.625 |
| 5 | Tesla | 75.0 | 3.7 | 250 | 415 | 181 | 6.0 | All Wheel Drive | 5 | 63667.000 | 59990.000 |
| 6 | Porsche | 83.7 | 5.1 | 220 | 425 | 197 | 1.0 | All Wheel Drive | 4 | 98514.000 | 84500.000 |
| 7 | Renault | 52.0 | 9.5 | 140 | 310 | 168 | 2.0 | Front Wheel Drive | 5 | 37840.000 | 29995.000 |
df1.to_csv("S:\BIA Capstone/EV_dataset_final.csv")